ETLMR: A Highly Scalable Dimensional ETL Framework Based on MapReduce

نویسندگان

  • Xiufeng Liu
  • Christian Thomsen
  • Torben Bach Pedersen
چکیده

Extract-Transform-Load (ETL) flows periodically populate data warehouses (DWs) with data from different source systems. An increasing challenge for ETL flows is processing huge volumes of data quickly. MapReduce is establishing itself as the de-facto standard for large-scale data-intensive processing. However, MapReduce lacks support for high-level ETL specific constructs, resulting in low ETL programmer productivity. This paper presents a scalable dimensional ETL framework, ETLMR, based on MapReduce. ETLMR has built-in native support for operations on DW-specific constructs such as star schemas, snowflake schemas and slowly changing dimensions (SCDs). This enables ETL developers to construct scalable MapReduce-based ETL flows with very few code lines. To achieve good performance and load balancing, a number of dimension and fact processing schemes are presented, including techniques for efficiently processing different types of dimensions. The paper describes the integration of ETLMR with a MapReduce framework and evaluates its performance on large realistic data sets. The experimental results show that ETLMR achieves very good scalability and compares favourably with other MapReduce data warehousing

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

MapReduce-based Dimensional ETL Made Easy

This paper demonstrates ETLMR, a novel dimensional Extract– Transform–Load (ETL) programming framework that uses MapReduce to achieve scalability. ETLMR has built-in native support of data warehouse (DW) specific constructs such as star schemas, snowflake schemas, and slowly changing dimensions (SCDs). This makes it possible to build MapReduce-based dimensional ETL flows very easily. The ETL pr...

متن کامل

CloudETL: Scalable Dimensional ETL for Hadoop and Hive

Extract-Transform-Load (ETL) programs process data from sources into data warehouses (DWs). Due to the rapid growth of data volumes, there is an increasing demand for systems that can scale on demand. Recently, much attention has been given to MapReduce which is a framework for highly parallel handling of massive data sets in cloud environments. The MapReduce-based Hive has been proposed as a D...

متن کامل

GraphBuilder: A Scalable Graph ETL Framework

Graph abstraction is essential for many applications, from finding a shortest path to executing complex machine learning (ML) algorithms like collaborative filtering. However, constructing graphs from relationships hidden within large unstructured datasets is challenging. Since graph construction is a data-parallel problem, MapReduce is well-suited for this task. We developed GraphBuilder, an o...

متن کامل

A Scalable XSLT Processing Framework based on MapReduce

The eXtensible Stylesheet Language Transformation (XSLT) is a de-facto standard for XML data transforming and extracting. Efficient processing of large amounts of XML data brings challenges to conventional XSLT processors, which are designed to run in a single machine context. To solve these data-intensive problems, MapReduce paradigm in the cloud computing domain has received a comprehensive a...

متن کامل

Big-ETL: Extracting-Transforming-Loading Approach for Big Data

ETL process (Extracting-Transforming-Loading) is responsible for (E)xtracting data from heterogeneous sources, (T)ransforming and finally (L)oading them into a data warehouse (DW). Nowadays, Internet and Web 2.0 are generating data at an increasing rate, and therefore put the information systems (IS) face to the challenge of big data. Data integration systems and ETL, in particular, should be r...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011